IE 360 Project - Group 8 - SPRING'24¶

Contributors

  • Berat Kubilay Güngez - 2021402087
  • Ebrar Baştan - 2020402078

Table of Contents¶

  1. Introduction
  2. Related Literature
  3. Approach and Results
    • 3a. Data Inspection & Manipulation
    • 3b. Model Building
      • 3b. Part I - ARIMA Model
      • 3b. Part II - Penalized Regression Model
      • 3b. Model III - Lasso Regression & ARIMA
      • 3b. Model IV - Combined Model & Monthly Dummy Variables
      • 3b. Additional Model - Hourly Model
  4. Comparison
  5. Conclusions and Future Work
  6. Code

1. Introduction ¶

Solar energy is one of the cleanest renewable energy sources and it's produced in solar power plants. Using it as a resource is very crucial in terms of sustainability; however, the production rate of solar energy is highly influenced by external factors. Weather effects such as cloud amounts during the day, the presence of snow, and temperature affect production a lot. These changes in solar energy production can lead to fluctuations or shortages in the energy supply of the energy chain, so the chain should be protected against it. Building an efficient forecasting model is important at that point and this project aims it. Different forecasting models for Edikli GES which is located in Niğde, Turkey will be constructed and compared. The geographical coordinates of the plant are between 37.75-38.75 north latitude and 34.5-35.5 east longitude. Production data and weather data from earlier periods are given to help build the models, and the objective is to make hourly forecasts for two days ahead. Different models will be tried to build with different techniques and compared according to error metrics. The built models and the approach used will be described in detail in the Approach section.

2. Related Literature ¶

Different forecasting techniques and ways to reach problems are discussed throughout the course. Therefore, the main source is the lecture notes. Other than these, an additional source. to learn more about ARIMA is used. The website can be found in this link. These are the resources in the theoretical part. While applying these theoretical knowledge to Python, additonal knowledge is needed. The website used for this can be found here.

  • ARIMA
  • Python

3. Approach and Results ¶

First of all, packages which will be used is imported and a function to calculate error metrics is builded. Error metrics that will be used are coefficient of variation (CV), forecast bias (FBias), mean absolute percentage error (MAPE), root mean squared error (RMSE), mean absolute deviation (MAD), mean absolute deviation percentage (MADP), and weighted mean absolute percentage error (WMAPE).

Imports¶

First of all, packages which will be used is imported and a function to calculate error metrics is builded. Error metrics that will be used are coefficient of variation (CV), forecast bias (FBias), mean absolute percentage error (MAPE), root mean squared error (RMSE), mean absolute deviation (MAD), mean absolute deviation percentage (MADP), and weighted mean absolute percentage error (WMAPE).

In [ ]:
import pandas as pd # To work with dataset
import numpy as np # To do some math
import matplotlib.pyplot as plt # To plot the data
from sklearn.preprocessing import MinMaxScaler # To scale the data
from statsmodels.tsa.stattools import acf # To plot autocorrelations
from statsmodels.tsa.stattools import pacf # To plot partial autocorrelations
import pmdarima as pm # To find the best ARIMA model
from sklearn.linear_model import LassoCV # To choose the best alpha
from sklearn.linear_model import Lasso # To build the Lasso Model
import statsmodels.api as sm # To build the models
import seaborn as sns # To plot the data
In [ ]:
# Error metrics
def accu(actual, forecast):
    error = actual - forecast
    mean = np.mean(actual)
    sd = np.std(actual)
    CV = sd / mean # Coefficient of Variation
    FBias = np.sum(error) / np.sum(actual) # Forecast Bias
    MAPE = np.sum(np.abs(error / (actual+1))) / len(actual) # Mean Absolute Percentage Error
    RMSE = np.sqrt(np.sum(error ** 2) / len(actual)) # Root Mean Squared Error
    MAD = np.sum(np.abs(error)) / len(actual) # Mean Absolute Deviation
    MADP = np.sum(np.abs(error)) / np.sum(np.abs(actual)) # Mean Absolute Deviation Percentage
    WMAPE = MAD / mean # Weighted Mean Absolute Percentage Error
    
    # Formatting the results
    result_dict = {
        "n": len(actual),
        "mean": mean,
        "sd": sd,
        "CV": CV,
        "FBias": FBias,
        "MAPE": MAPE,
        "RMSE": RMSE,
        "MAD": MAD,
        "MADP": MADP,
        "WMAPE": WMAPE
    }
    
    return pd.DataFrame(result_dict, index=[0])

3a. Data Inspection & Manipulation ¶

Production and weather data are available in csv files. Before starting, preprocessing is necessary. Firstly, data is read from the csv files and some operations are done to prepare them.

In [ ]:
# Set the location of the data
production_data_loc = "data/production.csv"
weather_data_loc = "data/processed_weather.csv"

As can be seen below, production above 10 is rescaled to 10. This transformation is done because of real-life regulations.

In [ ]:
# Load the production data
production_df = pd.read_csv(production_data_loc)
production_df["date"] = pd.to_datetime(production_df["date"])
production_df = production_df.iloc[4:] # Remove the first 4 rows which are not in weather data
production_df = production_df.drop_duplicates() # Remove duplicates
production_df.reset_index(drop=True, inplace=True) # Reset the index
# Set the maximum production to 10 because the production is capped at 10 in the real world
production_df["production"] = production_df["production"].apply(lambda x: 10 if x > 10 else x) 

production_df
Out[ ]:
date hour production
0 2022-01-01 4 0.0
1 2022-01-01 5 0.0
2 2022-01-01 6 0.0
3 2022-01-01 7 0.0
4 2022-01-01 8 3.4
... ... ... ...
20967 2024-05-23 19 0.0
20968 2024-05-23 20 0.0
20969 2024-05-23 21 0.0
20970 2024-05-23 22 0.0
20971 2024-05-23 23 0.0

20972 rows × 3 columns

In [ ]:
# Load the weather data
weather_df = pd.read_csv(weather_data_loc) 
weather_df["date"] = pd.to_datetime(weather_df["date"], format='%Y-%m-%d') # Convert the date to datetime
weather_df = weather_df.sort_values(by=['date', 'hour']) # Sort the data by date and hour
weather_df.reset_index(drop=True, inplace=True) # Reset the index

weather_df
Out[ ]:
date hour lat lon dswrf_surface tcdc_low.cloud.layer tcdc_middle.cloud.layer tcdc_high.cloud.layer tcdc_entire.atmosphere uswrf_top_of_atmosphere csnow_surface dlwrf_surface uswrf_surface tmp_surface
0 2022-01-01 4 38.00 34.75 0.000 3.0 5.0 7.2 15.1 0.000 0.0 225.079 0.000 269.004
1 2022-01-01 4 38.75 35.50 0.000 1.4 0.1 5.0 6.7 0.000 0.0 232.479 0.000 271.204
2 2022-01-01 4 38.00 35.25 0.000 0.0 5.1 4.1 10.6 0.000 0.0 209.479 0.000 265.004
3 2022-01-01 4 38.50 34.75 0.000 3.0 0.0 5.0 11.4 0.000 0.0 229.579 0.000 270.804
4 2022-01-01 4 38.75 34.50 0.000 5.0 0.0 5.0 14.7 0.000 0.0 241.779 0.000 271.404
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
527245 2024-05-28 21 38.25 34.50 278.144 51.4 6.1 0.0 52.2 145.312 0.0 314.501 54.592 283.700
527246 2024-05-28 21 38.00 34.75 340.848 0.7 0.0 0.0 0.7 101.184 0.0 293.701 69.440 284.700
527247 2024-05-28 21 38.75 34.75 284.224 18.1 0.0 0.7 19.2 140.672 0.0 330.601 61.568 287.400
527248 2024-05-28 21 38.25 35.25 290.560 35.1 1.1 0.0 35.6 128.304 0.0 326.601 61.632 288.300
527249 2024-05-28 21 38.25 34.75 267.664 45.7 0.9 0.0 46.3 152.784 0.0 326.901 54.960 285.800

527250 rows × 14 columns

In [ ]:
# Pivot the weather data
columns_to_pivot = weather_df.columns[4:] # Get the columns to pivot

for col in columns_to_pivot:
    weather_df[f'{col}_identifier'] = col + "_" + weather_df['lat'].astype(str) + "_" + weather_df['lon'].astype(str)

pivoted_dfs = []
for col in columns_to_pivot:
    pivoted_df = pd.pivot(
        weather_df,
        index=['date', 'hour'],
        columns=f'{col}_identifier',
        values=col
    )
    pivoted_df.columns.name = None 
    pivoted_df.reset_index(inplace=True)  
    pivoted_dfs.append(pivoted_df)

result_df = pivoted_dfs[0]
for df in pivoted_dfs[1:]:
    result_df = result_df.merge(df, on=['date', 'hour'], how='outer') 
    
df = result_df.iloc[:, :252] # Remove the identifier columns

df
Out[ ]:
date hour dswrf_surface_37.75_34.5 dswrf_surface_37.75_34.75 dswrf_surface_37.75_35.0 dswrf_surface_37.75_35.25 dswrf_surface_37.75_35.5 dswrf_surface_38.0_34.5 dswrf_surface_38.0_34.75 dswrf_surface_38.0_35.0 ... tmp_surface_38.5_34.5 tmp_surface_38.5_34.75 tmp_surface_38.5_35.0 tmp_surface_38.5_35.25 tmp_surface_38.5_35.5 tmp_surface_38.75_34.5 tmp_surface_38.75_34.75 tmp_surface_38.75_35.0 tmp_surface_38.75_35.25 tmp_surface_38.75_35.5
0 2022-01-01 4 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 271.404 270.804 270.504 271.204 262.204 271.404 271.404 270.904 270.304 271.204
1 2022-01-01 5 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 271.304 270.804 270.604 271.104 262.004 271.304 271.204 270.804 270.104 271.104
2 2022-01-01 6 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 271.331 270.831 270.531 270.931 262.031 271.231 271.131 270.731 270.031 271.131
3 2022-01-01 7 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 271.217 270.817 270.517 270.917 262.317 271.217 271.117 270.717 270.017 271.217
4 2022-01-01 8 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 271.194 270.894 270.594 270.894 263.394 271.194 271.094 270.594 269.894 270.994
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21085 2024-05-28 17 669.480 681.480 691.380 569.620 489.000 699.540 697.620 692.880 ... 294.225 292.725 296.525 292.825 289.125 296.725 297.825 296.925 296.725 297.025
21086 2024-05-28 18 582.380 582.560 596.280 511.620 438.240 604.260 602.320 598.100 ... 293.000 291.500 292.100 292.200 288.300 294.500 294.200 293.100 293.700 294.200
21087 2024-05-28 19 487.648 487.184 496.464 422.736 361.488 503.520 502.560 498.960 ... 290.453 289.553 289.553 288.753 285.953 292.053 291.753 290.453 290.153 290.153
21088 2024-05-28 20 397.200 396.592 403.648 343.472 293.616 409.856 409.008 405.776 ... 286.100 285.800 286.300 285.700 282.200 288.500 288.100 288.200 289.900 286.800
21089 2024-05-28 21 330.992 330.496 336.368 286.224 244.672 341.552 340.848 338.160 ... 285.300 285.000 285.800 285.000 281.300 287.600 287.400 286.300 289.000 287.300

21090 rows × 252 columns

On the 12th of May 2024, production data is given as 0 in the data. Since this is probably an error in the data, this date's data is dropped.

In [ ]:
date = pd.to_datetime("2024-05-12")

specific_production = production_df[production_df["date"] == date]

plt.figure(figsize=(8, 6))
plt.plot(specific_production["hour"], specific_production["production"], label="Production")
plt.xlabel("Hour")
plt.ylabel("Production")
plt.title("Production vs Hour")
plt.legend()
plt.show()

df = df[df["date"] != date]
production_df = production_df[production_df["date"] != date]
No description has been provided for this image

Production and weather data frames are combined below.

In [ ]:
# Get the data that matches with the production data
end_date = production_df["date"].iloc[-1]
end_hour = production_df["hour"].iloc[-1]

df_copy = df.copy() # To keep the original data and in case we need it later

cut_off_index = df[(df["date"] == end_date) & (df["hour"] == end_hour)].index.min() 
df = df.loc[:cut_off_index]
df = pd.merge(df, production_df, on=["date", "hour"], how = "inner") # Merge the production data with the weather data
df.ffill(inplace=True) # Fill the missing values with the previous value if there is any

df
Out[ ]:
date hour dswrf_surface_37.75_34.5 dswrf_surface_37.75_34.75 dswrf_surface_37.75_35.0 dswrf_surface_37.75_35.25 dswrf_surface_37.75_35.5 dswrf_surface_38.0_34.5 dswrf_surface_38.0_34.75 dswrf_surface_38.0_35.0 ... tmp_surface_38.5_34.75 tmp_surface_38.5_35.0 tmp_surface_38.5_35.25 tmp_surface_38.5_35.5 tmp_surface_38.75_34.5 tmp_surface_38.75_34.75 tmp_surface_38.75_35.0 tmp_surface_38.75_35.25 tmp_surface_38.75_35.5 production
0 2022-01-01 4 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 270.804 270.504 271.204 262.204 271.404 271.404 270.904 270.304 271.204 0.0
1 2022-01-01 5 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 270.804 270.604 271.104 262.004 271.304 271.204 270.804 270.104 271.104 0.0
2 2022-01-01 6 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 270.831 270.531 270.931 262.031 271.231 271.131 270.731 270.031 271.131 0.0
3 2022-01-01 7 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 270.817 270.517 270.917 262.317 271.217 271.117 270.717 270.017 271.217 0.0
4 2022-01-01 8 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 270.894 270.594 270.894 263.394 271.194 271.094 270.594 269.894 270.994 3.4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20943 2024-05-23 19 417.232 415.040 470.176 461.136 317.344 407.552 459.744 487.328 ... 288.409 291.109 291.209 287.109 288.709 291.709 294.209 294.809 294.809 0.0
20944 2024-05-23 20 339.168 337.536 381.040 372.608 258.112 328.048 371.328 393.680 ... 287.100 287.000 289.200 284.100 289.700 289.500 288.100 290.200 291.800 0.0
20945 2024-05-23 21 282.640 281.280 317.520 310.512 215.088 273.360 309.440 328.064 ... 287.200 287.000 285.700 284.400 290.300 290.000 288.300 287.800 287.300 0.0
20946 2024-05-23 22 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 288.700 287.700 285.400 284.100 289.300 289.200 287.400 287.600 287.700 0.0
20947 2024-05-23 23 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 286.934 287.234 285.934 283.034 288.034 288.134 287.734 287.434 286.934 0.0

20948 rows × 253 columns

To understand the production data before any operation, autocorrelation within itself is plotted.

In [ ]:
# Plot the autocorrelation of the production data
acf_values = acf(df["production"], nlags=24*10)
plt.figure(figsize=(20, 5))
plt.plot(range(len(acf_values)), acf_values, marker='o', linestyle='-')
for i in range(1, 11):
    plt.axvline(x=24*i, color='red', linestyle='--')
plt.xlabel("Lag")
plt.ylabel("Autocorrelation")
plt.show()
No description has been provided for this image

As it can be seen from above, production data has high autocorrelation and it seems autocorrelation has a seasonal pattern. Lags which are shown with red lines have the peaks. It is lag 24 and its multiplicatives. It's expected to have high autocorrelation at the same hour in every day because of the weather conditions such as temperature and sunlight. Thus, it's a reasonable result.

In [ ]:
# Let us inspect a single location to understand the data
specific_loc = [col for col in df.columns if '38.0_34.5' in col]
extra_columns = ["production"]
df_filtered = pd.concat([df[extra_columns],df[specific_loc]], axis=1)

df_filtered
Out[ ]:
production dswrf_surface_38.0_34.5 tcdc_low.cloud.layer_38.0_34.5 tcdc_middle.cloud.layer_38.0_34.5 tcdc_high.cloud.layer_38.0_34.5 tcdc_entire.atmosphere_38.0_34.5 uswrf_top_of_atmosphere_38.0_34.5 csnow_surface_38.0_34.5 dlwrf_surface_38.0_34.5 uswrf_surface_38.0_34.5 tmp_surface_38.0_34.5
0 0.0 0.000 0.0 5.0 7.6 14.0 0.000 0.0 226.179 0.000 268.204
1 0.0 0.000 1.3 4.9 23.6 29.1 0.000 0.0 226.026 0.000 268.104
2 0.0 0.000 1.7 4.1 30.7 35.6 0.000 0.0 225.800 0.000 267.931
3 0.0 0.000 1.9 4.3 40.6 46.0 0.000 0.0 226.148 0.000 268.117
4 3.4 0.000 2.1 4.5 40.0 46.6 0.000 0.0 226.145 0.000 267.994
... ... ... ... ... ... ... ... ... ... ... ...
20943 0.0 407.552 1.8 44.7 23.7 45.6 206.928 0.0 341.392 74.800 292.409
20944 0.0 328.048 3.9 47.4 36.4 54.7 175.312 0.0 344.946 60.272 289.700
20945 0.0 273.360 8.8 54.3 43.5 62.2 146.080 0.0 347.346 50.240 287.900
20946 0.0 0.000 0.0 2.6 87.2 87.2 0.000 0.0 301.764 0.000 286.600
20947 0.0 0.000 0.0 1.3 43.6 43.6 0.000 0.0 295.806 0.000 285.934

20948 rows × 11 columns

To understand which variables affect the production amount the most, correlation matrix is printed.

In [ ]:
# Print the correlation matrix
df_filtered.corr()
Out[ ]:
production dswrf_surface_38.0_34.5 tcdc_low.cloud.layer_38.0_34.5 tcdc_middle.cloud.layer_38.0_34.5 tcdc_high.cloud.layer_38.0_34.5 tcdc_entire.atmosphere_38.0_34.5 uswrf_top_of_atmosphere_38.0_34.5 csnow_surface_38.0_34.5 dlwrf_surface_38.0_34.5 uswrf_surface_38.0_34.5 tmp_surface_38.0_34.5
production 1.000000 0.602806 -0.169203 -0.218731 -0.138377 -0.217157 0.358575 -0.100748 0.020937 0.516562 0.531894
dswrf_surface_38.0_34.5 0.602806 1.000000 -0.145427 -0.189573 -0.160620 -0.199366 0.649627 -0.084242 0.214118 0.841934 0.698949
tcdc_low.cloud.layer_38.0_34.5 -0.169203 -0.145427 1.000000 0.486816 0.108026 0.607934 0.281135 0.544634 0.188494 -0.033522 -0.275991
tcdc_middle.cloud.layer_38.0_34.5 -0.218731 -0.189573 0.486816 1.000000 0.492768 0.714380 0.184370 0.294590 0.261409 -0.135656 -0.206837
tcdc_high.cloud.layer_38.0_34.5 -0.138377 -0.160620 0.108026 0.492768 1.000000 0.718796 0.048605 0.059755 0.056998 -0.140608 -0.183636
tcdc_entire.atmosphere_38.0_34.5 -0.217157 -0.199366 0.607934 0.714380 0.718796 1.000000 0.221062 0.330721 0.138677 -0.107047 -0.314578
uswrf_top_of_atmosphere_38.0_34.5 0.358575 0.649627 0.281135 0.184370 0.048605 0.221062 1.000000 0.115868 0.248971 0.720640 0.331322
csnow_surface_38.0_34.5 -0.100748 -0.084242 0.544634 0.294590 0.059755 0.330721 0.115868 1.000000 -0.008231 0.023359 -0.241555
dlwrf_surface_38.0_34.5 0.020937 0.214118 0.188494 0.261409 0.056998 0.138677 0.248971 -0.008231 1.000000 -0.002359 0.673935
uswrf_surface_38.0_34.5 0.516562 0.841934 -0.033522 -0.135656 -0.140608 -0.107047 0.720640 0.023359 -0.002359 1.000000 0.398708
tmp_surface_38.0_34.5 0.531894 0.698949 -0.275991 -0.206837 -0.183636 -0.314578 0.331322 -0.241555 0.673935 0.398708 1.000000

The production amount is correlated with dswrf_surface variable the most. Uswrf_surface, tmp_surface, and uswrf_top_atmosphere are also positively correlated with it. Cloud layer variables ( low, middle and high) and tcdc_entire.atmosphere are negatively correlated with it, but their values are lower than the positive ones in absolute sense. Other than these, cloud variables have high (more than 0.6) correlation with tcdc_entire.atmosphere and temperature variable has high correlation with radiation variables dswrf_surface and dlwrf_surface. These high correlations can lead to multicollinearity in the model. In model building process, this needs to be considered.

In [ ]:
# Plot the data compared to the production
plt_lst = df_filtered.columns[1:]

for col in plt_lst:
    y = df_filtered["production"]
    x = df_filtered[col]

    coeffs = np.polyfit(x, y, 2)
    poly_func = np.poly1d(coeffs)
    
    x_fit = np.linspace(min(x), max(x), 100)  
    y_fit = poly_func(x_fit)  

    plt.figure(figsize=(6, 6))
    plt.plot(x, y, 'o', label='Original Data', markersize=1)  
    plt.plot(x_fit, y_fit, '--', label='Quadratic Fit', linewidth=10)  
    plt.title(f"Production vs {col}")
    plt.xlabel(col)
    plt.ylabel("Production")
    plt.xlim([min(x) - 1, max(x) + 1])  # Adjust as needed
    plt.ylim([min(y) - 1, max(y) + 1])  # Adjust as needed
    plt.legend(loc ="upper right")
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
/var/folders/mk/yzmswp8x76g_k99y3jccgr0c0000gn/T/ipykernel_4044/3754004380.py:8: RankWarning: Polyfit may be poorly conditioned
  coeffs = np.polyfit(x, y, 2)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Pairwise relations with production data are plotted to observe relation patterns to build the model, but a clear pattern could not be found. Even though the quadratic fit function finds patterns in these, they don't seem to be observable.

In [ ]:
# Plot pairwise relationships in the dataset
pair_plot = sns.pairplot(df_filtered, kind='scatter', diag_kind='kde', markers='o', plot_kws={'alpha':0.5})
plt.show()
No description has been provided for this image

The production data is scaled to have values between 0 and 10 as a preprocessing step. This is done because of production data is between to 0 to 10.

In [ ]:
# Scale the data
scaler = MinMaxScaler(feature_range=(0, 10)) # Scale the data between 0 and 10
columns_to_scale = df.columns[2:] # Do not scale the date and hour columns

df.loc[:, columns_to_scale] = scaler.fit_transform(df.loc[:, columns_to_scale])

Relations were not clear in the plots above, so the data is decomposed to see and understand trend, seasonality, and remainder components. As can be seen, since data is too long, only the last 1000 data points are plotted.

In [ ]:
production = df["production"]

production

# Decompose the data to understand the trend, seasonal and remainder components
decomposition = sm.tsa.seasonal_decompose(production, model='additive', period=24) # Period is 24 because the seasonality seems to be hourly

trend = decomposition.trend
seasonal = decomposition.seasonal
remainder = decomposition.resid

plt.figure(figsize=(16, 8))
plt.plot(production[-1000:], label='Original')
plt.legend(loc='best')
plt.title('Production')
plt.show()

plt.figure(figsize=(16, 8))
plt.plot(trend[-1000:], label='Trend', color='red')
plt.legend(loc='best')
plt.title('Trend Component')
plt.show()

plt.figure(figsize=(16, 8))
plt.plot(seasonal[-1000:], label='Seasonal', color='red')
plt.legend(loc='best')
plt.title('Seasonal Component')
plt.show()

plt.figure(figsize=(16, 8))
plt.plot(remainder[-1000:], label='Remainder', color='red')
plt.legend(loc='best')
plt.title('Remainder Component')
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

The data is definitely seasonal as it's observed beforehand. It has a somewhat changing trend, but it is not that clear. Also, the remainder component seems to have a stable mean but some points have too low values than others which may affect variance. It does not seem to have a complete white noise series pattern.

3b. Model Building ¶

The production data from previous years are given, and they will be used for training purposes. 2022 and 2023 data is used for training and test data starts at February 2024. The data is splitted as test and train groups to observe performance before deciding which model to use and submitting forecasts.

In [ ]:
train_test_split_date = pd.to_datetime("2024-02-01")

df_train= df[df['date'] < train_test_split_date] # 2022 and 2023 data is used for training

df_train
Out[ ]:
date hour dswrf_surface_37.75_34.5 dswrf_surface_37.75_34.75 dswrf_surface_37.75_35.0 dswrf_surface_37.75_35.25 dswrf_surface_37.75_35.5 dswrf_surface_38.0_34.5 dswrf_surface_38.0_34.75 dswrf_surface_38.0_35.0 ... tmp_surface_38.5_34.75 tmp_surface_38.5_35.0 tmp_surface_38.5_35.25 tmp_surface_38.5_35.5 tmp_surface_38.75_34.5 tmp_surface_38.75_34.75 tmp_surface_38.75_35.0 tmp_surface_38.75_35.25 tmp_surface_38.75_35.5 production
0 2022-01-01 4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.792751 2.563750 2.560073 2.381561 2.866957 2.738521 2.454569 2.315101 2.207942 0.0
1 2022-01-01 5 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.792751 2.576989 2.546334 2.355981 2.854968 2.714355 2.441626 2.290149 2.194867 0.0
2 2022-01-01 6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.796341 2.567324 2.522566 2.359434 2.846216 2.705534 2.432177 2.281041 2.198397 0.0
3 2022-01-01 7 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.794479 2.565471 2.520642 2.396015 2.844537 2.703842 2.430365 2.279294 2.209641 0.0
4 2022-01-01 8 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.804718 2.575665 2.517483 2.533766 2.841780 2.701063 2.414445 2.263948 2.180485 3.4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18255 2024-01-31 19 1.576039 1.070560 1.187324 0.862606 1.083381 1.712690 1.449336 1.104238 ... 2.620830 2.458758 2.615989 2.344086 2.939731 2.715201 2.468418 2.428261 2.208857 0.0
18256 2024-01-31 20 1.260746 0.856532 0.950066 0.690044 0.866578 1.370194 1.159552 0.883349 ... 2.626281 2.450947 2.607883 2.349330 2.896690 2.683905 2.460782 2.408424 2.201143 0.0
18257 2024-01-31 21 1.050692 0.713777 0.791618 0.575071 0.722254 1.141723 0.966154 0.736089 ... 2.612586 2.437310 2.593732 2.323365 2.884341 2.683543 2.447450 2.395573 2.200750 0.0
18258 2024-01-31 22 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.599689 2.424467 2.580406 2.336539 2.884701 2.671822 2.434895 2.395948 2.201143 0.0
18259 2024-01-31 23 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.605805 2.404078 2.572987 2.355213 2.878227 2.677380 2.440849 2.389211 2.207157 0.0

18260 rows × 253 columns

In [ ]:
df_test = df[df['date'] >= train_test_split_date] # 2024 data is used for testing

df_test
Out[ ]:
date hour dswrf_surface_37.75_34.5 dswrf_surface_37.75_34.75 dswrf_surface_37.75_35.0 dswrf_surface_37.75_35.25 dswrf_surface_37.75_35.5 dswrf_surface_38.0_34.5 dswrf_surface_38.0_34.75 dswrf_surface_38.0_35.0 ... tmp_surface_38.5_34.75 tmp_surface_38.5_35.0 tmp_surface_38.5_35.25 tmp_surface_38.5_35.5 tmp_surface_38.75_34.5 tmp_surface_38.75_34.75 tmp_surface_38.75_35.0 tmp_surface_38.75_35.25 tmp_surface_38.75_35.5 production
18260 2024-02-01 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.595966 2.394280 2.562820 2.332958 2.881344 2.680522 2.431271 2.404931 2.184407 0.0
18261 2024-02-01 1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.587456 2.399047 2.540289 2.324772 2.873671 2.672789 2.435931 2.384470 2.162964 0.0
18262 2024-02-01 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.594902 2.393221 2.520505 2.319145 2.868395 2.679555 2.430236 2.378980 2.157211 0.0
18263 2024-02-01 3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.577085 2.375480 2.502095 2.276425 2.852330 2.675447 2.412892 2.349785 2.139691 0.0
18264 2024-02-01 4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.419923 2.218986 2.477091 2.227566 2.830510 2.677622 2.415221 2.339555 2.115895 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20943 2024-05-23 19 4.417397 4.350159 4.862816 4.725335 3.333375 4.279121 4.806142 5.068056 ... 5.133561 5.291805 5.308503 5.566996 4.941673 5.192001 5.471007 5.372417 5.294248 0.0
20944 2024-05-23 20 3.590903 3.537817 3.940923 3.818174 2.711204 3.444363 3.881845 4.094146 ... 4.959513 4.747782 5.032492 5.182134 5.060485 4.925085 4.680300 4.797385 4.900828 0.0
20945 2024-05-23 21 2.992419 2.948180 3.283965 3.181867 2.259280 2.870162 3.234871 3.411761 ... 4.972809 4.747782 4.551637 5.220505 5.132420 4.985500 4.706187 4.497954 4.312462 0.0
20946 2024-05-23 22 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 5.172253 4.840461 4.510421 5.182134 5.012529 4.888835 4.589697 4.473001 4.364761 0.0
20947 2024-05-23 23 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 4.937441 4.778763 4.583786 5.045789 4.860746 4.760029 4.632928 4.452291 4.264608 0.0

2688 rows × 253 columns

3b. Part I - ARIMA Model ¶

ARIMA is one of the forecasting techniques discussed in class and it works well, so it'll be the first model. Before trying to find ARIMA coefficients, differencing is done to stationarize the data. This will also decrease the time to find a proper ARIMA variable pair. Differencing will be done with lag 24 which corresponds to a day before at the same hour.

In [ ]:
# Find the best ARIMA model of differenced data 
# This staionarizes the data and makes it easier to model
diff_24_production_test = [df_train["production"][i] - df_train["production"][i-24] for i in range(24, len(df_train["production"]))]

acf_values = acf(diff_24_production_test, nlags=24*10)
plt.figure(figsize=(20, 5))
plt.plot(range(len(acf_values)), acf_values, marker='o', linestyle='-')
plt.xlabel("Lag")
plt.ylabel("Autocorrelation")
plt.show()
No description has been provided for this image

Except from the first few observed data and the point where autocorrelation reaches -0.4, data have autocorrelation near 0 after differencing. The seasonal pattern is gone as it can be seen.

In [ ]:
acf_values = pacf(diff_24_production_test, nlags=24*10)
plt.figure(figsize=(20, 5))
plt.plot(range(len(acf_values)), acf_values, marker='o', linestyle='-')
plt.xlabel("Lag")
plt.ylabel("Autocorrelation")
plt.show()
No description has been provided for this image

Partial autocorrelations seem to have a somewhat dampening seasonal pattern.

Autocorrelation and partial autocorrelation plots of differenced data indicate an ARIMA(2,0,0) model can work well. To find this model and observe other options, auto_arima is used.

In [ ]:
# Find the best ARIMA model by using auto_arima
arima_model = pm.auto_arima(
    diff_24_production_test,
    # Limit differencing to 1 since the data is already differenced
    start_d=0, 
    max_d=1, 
    seasonal=False, # Seasonal is False because the data is already differenced
    stepwise=True,   
    trace=True,      
    error_action='ignore',  
    suppress_warnings=True, 
    max_order=None,     
    information_criterion='aic', # Use AIC to find the best model
)

best_arima = arima_model.fit(diff_24_production_test) # Fit the best ARIMA model
Performing stepwise search to minimize aic
 ARIMA(2,0,2)(0,0,0)[0]             : AIC=62877.150, Time=0.99 sec
 ARIMA(0,0,0)(0,0,0)[0]             : AIC=73947.896, Time=0.05 sec
 ARIMA(1,0,0)(0,0,0)[0]             : AIC=62902.550, Time=0.20 sec
 ARIMA(0,0,1)(0,0,0)[0]             : AIC=66633.340, Time=0.28 sec
 ARIMA(1,0,2)(0,0,0)[0]             : AIC=62879.114, Time=0.54 sec
 ARIMA(2,0,1)(0,0,0)[0]             : AIC=62877.011, Time=0.99 sec
 ARIMA(1,0,1)(0,0,0)[0]             : AIC=62892.816, Time=0.33 sec
 ARIMA(2,0,0)(0,0,0)[0]             : AIC=62891.773, Time=0.31 sec
 ARIMA(3,0,1)(0,0,0)[0]             : AIC=62877.079, Time=1.04 sec
 ARIMA(3,0,0)(0,0,0)[0]             : AIC=62877.777, Time=0.40 sec
 ARIMA(3,0,2)(0,0,0)[0]             : AIC=62878.949, Time=1.38 sec
 ARIMA(2,0,1)(0,0,0)[0] intercept   : AIC=62879.010, Time=3.56 sec

Best model:  ARIMA(2,0,1)(0,0,0)[0]          
Total fit time: 10.085 seconds

ARIMA coefficients are found with the auto_arima function. This algorithm does not guarantee an optimal therefore this result may not be the best p,d,q choice. But compared to our guess ARIMA(2,0,0), ARIMA(2,0,1)(which corresponds to 2 autoregressive terms and 1 moving average term) seems to be better in terms of AIC values. Therefore this result will be used in the following applications.

In [ ]:
best_arima.summary()
Out[ ]:
SARIMAX Results
Dep. Variable: y No. Observations: 18236
Model: SARIMAX(2, 0, 1) Log Likelihood -31434.506
Date: Wed, 05 Jun 2024 AIC 62877.011
Time: 12:59:48 BIC 62908.256
Sample: 0 HQIC 62887.279
- 18236
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 0.0167 0.078 0.214 0.831 -0.136 0.170
ar.L2 0.4600 0.051 8.983 0.000 0.360 0.560
ma.L1 0.6375 0.079 8.061 0.000 0.482 0.792
sigma2 1.8397 0.008 225.967 0.000 1.824 1.856
Ljung-Box (L1) (Q): 0.20 Jarque-Bera (JB): 66741.16
Prob(Q): 0.65 Prob(JB): 0.00
Heteroskedasticity (H): 0.69 Skew: -0.07
Prob(H) (two-sided): 0.00 Kurtosis: 12.37


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Other performance measures BIC and HQIC give 60331.923 and 30311.080 respectively.

In [ ]:
# Plot the results of the ARIMA model
forecasted_values = list(df["production"][:24])

for diff in list(best_arima.fittedvalues()):
    new_value = forecasted_values[-24] + diff
    forecasted_values.append(new_value)

forecasted_values = [max(0, i) for i in forecasted_values]

plt.figure(figsize=(12, 8))
plt.plot(range(len(df_train["production"][-750:])),forecasted_values[-750:], label="Forecast", c = "red")
plt.plot(range(len(df_train["production"][-750:])),df_train["production"][-750:], label="Real", c="blue")
plt.title("Fitted Values vs Real Production")
plt.legend(loc="best")
plt.show()
No description has been provided for this image

Forecasted and real production amounts are plotted together to visualize the situation better. Red ones represent forecasted ones and blue ones represent real values. At most of the time points, the ARIMA model overpredicts the production values. At some point, the gap seems to be too large, so the ARIMA model did not give good predictions as wanted in the train set but since this model is a starting point, this model is used to improve/build other models as well.

In [ ]:
accu(df_train["production"], forecasted_values) # Calculate the error metrics
Out[ ]:
n mean sd CV FBias MAPE RMSE MAD MADP WMAPE
0 18260 2.565885 3.688927 1.437682 0.075443 0.213338 1.478055 0.808813 0.315218 0.315218

Error metrics are calculated and the results can be seen above. Although it seems the model overpredicts, the FBias value is positive, which means the model underpredicts. Probably in the part that is not plotted, ARIMA underpredicts. MAPE, MADP, and WMAPE are not good but are not that bad, but the MAD value is too large. It will be clear once they are compared with other models.

In [ ]:
# Test the ARIMA model
df_train_last_24 = df_train["production"].iloc[-24:].tolist()
add_values = df_train_last_24 + df_test["production"].iloc[:-24].tolist()
forecast_ = arima_model.predict(n_periods=len(df_test))

arima_test_pred = [add_values[i] + forecast_[i] for i in range(len(forecast_))]
In [ ]:
# Test Period
plt.figure(figsize=(12, 8))
plt.plot(range(len(df_test["production"].iloc[-750:])), arima_test_pred[-750:], label="Forecast", c="red")
plt.plot(range(len(df_test["production"].iloc[-750:])), df_test["production"].iloc[-750:], label="Real", c="blue")
plt.title("Forecast vs Real Production")
plt.legend(loc="best")
plt.show()
No description has been provided for this image

Real and forecasted values of test data are plotted. Forecasted and real values are represented by red and blue lines, respectively. It seems there are more gaps at some time points between these two than train set results. However, it predicts some time points better. It will be more clear after checking error metrices.

In [ ]:
accu(df_test["production"], arima_test_pred) # Calculate the error metrics
Out[ ]:
n mean sd CV FBias MAPE RMSE MAD MADP WMAPE
0 2688 2.600301 3.496188 1.344532 0.002229 0.198583 1.919275 0.947456 0.364364 0.364364

Train data predictions have higher error metric values than test data. The model's underprediction rate is lower in this case, but error values are more than desired. Therefore, the ARIMA model will not be used in the forecasting procedure and new models will be tried to be built. Also, one other reason why ARIMA is not used directly is the lack of weather regressors.

In [ ]:
# Save the results
forecasted_train = {"Arima": forecasted_values}
forecasted_test = {"Arima": arima_test_pred}

3b. Model II - Penalized Regression Model ¶

As it's shown previously, there are a lot of variables which affect production amount at different rates. After plotting their relations with the production data, clear patterns could not be found. Therefore, letting the machine decide which variables to include the model seems a logical way. In this part, to achieve this, a penalized regression model will be built and how it works will be observed.

In [ ]:
# Lasso Regression
features_train = df_train.copy()
features_train.drop(columns=["date", "hour", "production"], inplace=True)

lst = weather_df.columns[4:-10]

# Include the following features
# Max, Min, Mean, Sum, Median, Std, Var
for feature in lst:
    feature_col = df_train.columns[df_train.columns.str.contains(feature)]
    features_train[f"max_{feature}"] = df_train[feature_col].max(axis=1)
    features_train[f"min_{feature}"] = df_train[feature_col].min(axis=1)
    features_train[f"mean_{feature}"] = df_train[feature_col].mean(axis=1)
    features_train[f"sum_{feature}"] = df_train[feature_col].sum(axis=1)
    features_train[f"median_{feature}"] = df_train[feature_col].median(axis=1)
    features_train[f"std_{feature}"] = df_train[feature_col].std(axis=1)
    features_train[f"var_{feature}"] = df_train[feature_col].var(axis=1)

features_train
Out[ ]:
dswrf_surface_37.75_34.5 dswrf_surface_37.75_34.75 dswrf_surface_37.75_35.0 dswrf_surface_37.75_35.25 dswrf_surface_37.75_35.5 dswrf_surface_38.0_34.5 dswrf_surface_38.0_34.75 dswrf_surface_38.0_35.0 dswrf_surface_38.0_35.25 dswrf_surface_38.0_35.5 ... median_uswrf_surface std_uswrf_surface var_uswrf_surface max_tmp_surface min_tmp_surface mean_tmp_surface sum_tmp_surface median_tmp_surface std_tmp_surface var_tmp_surface
0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 2.970018 1.938558 2.477303 61.932586 2.524168 0.283675 0.080472
1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 2.957112 1.909188 2.461961 61.549024 2.512177 0.287704 0.082774
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 2.960597 1.883783 2.452826 61.320651 2.487963 0.290394 0.084329
3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 2.945884 1.881728 2.448251 61.206282 2.472008 0.287081 0.082416
4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 2.942915 1.863665 2.448233 61.205814 2.486316 0.291087 0.084732
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18255 1.576039 1.070560 1.187324 0.862606 1.083381 1.712690 1.449336 1.104238 1.268847 1.554716 ... 0.884301 0.240753 0.057962 2.939731 2.208857 2.645475 66.136863 2.677194 0.193474 0.037432
18256 1.260746 0.856532 0.950066 0.690044 0.866578 1.370194 1.159552 0.883349 1.015078 1.243814 ... 0.707501 0.192597 0.037094 2.896690 1.827368 2.540076 63.501900 2.607883 0.257617 0.066367
18257 1.050692 0.713777 0.791618 0.575071 0.722254 1.141723 0.966154 0.736089 0.845761 1.036477 ... 0.589434 0.160505 0.025762 2.884341 1.856992 2.493482 62.337046 2.543567 0.255527 0.065294
18258 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 2.884701 1.860687 2.484138 62.103449 2.462910 0.234975 0.055213
18259 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 2.878227 1.796942 2.464852 61.621300 2.440849 0.243719 0.059399

18260 rows × 320 columns

Maximum, minimum, mean, sum, median, standard deviation and variance of variables can be found above.

In [ ]:
# Find the optimal lambda
lasso_cv = LassoCV(cv=5) # 5 fold cross validation
lasso_cv.fit(features_train, df_train["production"]) # Finds the best alpha

optimal_lambda = lasso_cv.alpha_
optimal_lambda
Out[ ]:
0.16421425589191282

Optimal lambda value is calculated with cross validation in the train data and found as 0.16756. Note that this value can change in different runs.

In [ ]:
# Train the Lasso model
lasso = Lasso(optimal_lambda)

lasso.fit(features_train, df_train["production"])
Out[ ]:
Lasso(alpha=0.16421425589191282)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.16421425589191282)

Lasso model is built with this alpha value.

In [ ]:
# Plot the results
lasso_result = lasso.predict(features_train)
fitted = [max(0, i) for i in lasso_result]

plt.figure(figsize=(12, 8))
plt.plot(range(len(fitted[-750:])), df_train["production"][-750:], label="Real Values", color="blue")
plt.plot(range(len(fitted[-750:])), fitted[-750:], label="Fitted Values", color="red")
plt.legend(loc="best")
plt.show()
No description has been provided for this image

Forecasted and real values are plotted together for visualization and as the previous model, red line represents predcited values while blue one represents real ones. There are huge gaps between these two plots. At points where production amount is higher the model underpredicts the values, and while the production is low (zero) the model overpredicts it. It will work worse than ARIMA model but let's see test data and error metrices.

In [ ]:
accu(df_train["production"], fitted) # Calculate the error metrics
Out[ ]:
n mean sd CV FBias MAPE RMSE MAD MADP WMAPE
0 18260 2.565885 3.688927 1.437682 -0.035382 0.703112 2.137464 1.499716 0.584483 0.584483

Error metric values are higher than ARIMA model and they are far beyond desired level. Lasso regression model overpredicts the production amount as opposed to ARIMA model.

In [ ]:
# Features for the test data
features_test = df_test.copy()
features_test.drop(columns=["date", "hour", "production"], inplace=True)

lst = weather_df.columns[4:-10]

for feature in lst:
    feature_col = df_test.columns[df_test.columns.str.contains(feature)]
    features_test[f"max_{feature}"] = df_test[feature_col].max(axis=1)
    features_test[f"min_{feature}"] = df_test[feature_col].min(axis=1)
    features_test[f"mean_{feature}"] = df_test[feature_col].mean(axis=1)
    features_test[f"sum_{feature}"] = df_test[feature_col].sum(axis=1)
    features_test[f"median_{feature}"] = df_test[feature_col].median(axis=1)
    features_test[f"std_{feature}"] = df_test[feature_col].std(axis=1)
    features_test[f"var_{feature}"] = df_test[feature_col].var(axis=1)

features_test
Out[ ]:
dswrf_surface_37.75_34.5 dswrf_surface_37.75_34.75 dswrf_surface_37.75_35.0 dswrf_surface_37.75_35.25 dswrf_surface_37.75_35.5 dswrf_surface_38.0_34.5 dswrf_surface_38.0_34.75 dswrf_surface_38.0_35.0 dswrf_surface_38.0_35.25 dswrf_surface_38.0_35.5 ... median_uswrf_surface std_uswrf_surface var_uswrf_surface max_tmp_surface min_tmp_surface mean_tmp_surface sum_tmp_surface median_tmp_surface std_tmp_surface var_tmp_surface
18260 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 2.881344 1.744429 2.433000 60.824993 2.427178 0.276363 0.076376
18261 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 2.873671 1.633083 2.419346 60.483651 2.422292 0.296542 0.087937
18262 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 2.868395 1.641504 2.418919 60.472971 2.420684 0.301063 0.090639
18263 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 2.852330 1.626206 2.406809 60.170233 2.412892 0.308227 0.095004
18264 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 2.830510 1.586611 2.335391 58.384774 2.339555 0.291150 0.084768
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20943 4.417397 4.350159 4.862816 4.725335 3.333375 4.279121 4.806142 5.068056 4.450029 4.830602 ... 1.570803 0.334526 0.111908 5.692908 4.556321 5.291818 132.295448 5.294248 0.263889 0.069638
20944 3.590903 3.537817 3.940923 3.818174 2.711204 3.444363 3.881845 4.094146 3.609129 3.911386 ... 1.275921 0.273807 0.074970 5.255237 4.522519 4.937514 123.437858 4.915365 0.174438 0.030428
20945 2.992419 2.948180 3.283965 3.181867 2.259280 2.870162 3.234871 3.411761 3.007663 3.259600 ... 1.063162 0.228177 0.052065 5.220505 4.312462 4.840934 121.023357 4.799483 0.245668 0.060353
20946 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 5.189401 4.364761 4.795418 119.885444 4.834985 0.239255 0.057243
20947 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 5.099118 4.264608 4.695417 117.385428 4.680361 0.197033 0.038822

2688 rows × 320 columns

In [ ]:
# Plot the results
predictions = lasso.predict(features_test)
predictions_test = [max(0, i) for i in predictions]

plt.figure(figsize=(12, 8))
plt.plot(range(len(predictions_test[-750:])), df_test["production"].iloc[-750:], label="Real Values", color="blue")
plt.plot(range(len(predictions_test[-750:])), predictions_test[-750:], label="Fitted Values", color="red")
plt.legend(loc="best")
plt.show()
No description has been provided for this image

Overprediction and underprediction cases are observed in test data. Also, it seems the model overpredicts more at this case compared to test data. Let's see error metrices.

In [ ]:
accu(df_test["production"], predictions_test) # Calculate the error metrics
Out[ ]:
n mean sd CV FBias MAPE RMSE MAD MADP WMAPE
0 2688 2.600301 3.496188 1.344532 -0.127096 0.933208 2.224446 1.784562 0.68629 0.68629

As it is observed, the model overpredicts more at this case (FBias is more negative). Error metric values are much worse in this data set. The model have 75% error while predicting. It cannot be used in this case. Search for better models countinues.

In [ ]:
forecasted_train["Lasso"] = fitted
forecasted_test["Lasso"] = predictions_test

3b. Model III - ARIMA & Lasso Regression Model ¶

The previous two models lacked the same parts of the given data oppositely. The Lasso model lacked the relations between production data itself whereas the ARIMA model lacked weather regressors. So, combining these two in a linear model setting can work well. In this part, ARIMA and Lasso models are combined to build a new model.

In [ ]:
# Combine the results
df_combined = pd.DataFrame(data = [fitted, forecasted_values])
df_combined = df_combined.T
df_combined.columns = ['Lasso', 'Arima']
df_combined = sm.add_constant(df_combined)

df_combined
Out[ ]:
const Lasso Arima
0 1.0 0.000000 0.000000
1 1.0 0.000000 0.000000
2 1.0 0.000000 0.000000
3 1.0 0.000000 0.000000
4 1.0 0.000000 3.400000
... ... ... ...
18255 1.0 0.462583 0.019966
18256 1.0 0.261788 0.000000
18257 1.0 0.260214 0.008114
18258 1.0 0.987310 0.000000
18259 1.0 0.940808 0.003297

18260 rows × 3 columns

For combination, a linear regression model will be built.

In [ ]:
# Fit them through linear regression
lm = sm.OLS(df_train["production"], df_combined)
result = lm.fit()
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             production   R-squared:                       0.863
Model:                            OLS   Adj. R-squared:                  0.863
Method:                 Least Squares   F-statistic:                 5.747e+04
Date:                Wed, 05 Jun 2024   Prob (F-statistic):               0.00
Time:                        12:59:51   Log-Likelihood:                -31601.
No. Observations:               18260   AIC:                         6.321e+04
Df Residuals:                   18257   BIC:                         6.323e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.1228      0.014     -8.902      0.000      -0.150      -0.096
Lasso          0.3050      0.006     52.523      0.000       0.294       0.316
Arima          0.7918      0.005    161.779      0.000       0.782       0.801
==============================================================================
Omnibus:                     1988.077   Durbin-Watson:                   1.510
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            15131.111
Skew:                           0.235   Prob(JB):                         0.00
Kurtosis:                       7.435   Cond. No.                         7.73
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The regression model has 0.863 R-squared value, and it is a good value. Let's see how it works while predicting the production values.

In [ ]:
# Plot the results
fitted = [max(0, i) for i in result.fittedvalues]
fitted = [min(10, i) for i in fitted]

plt.figure(figsize=(12, 8))
plt.plot(range(len(fitted[-750:])), df_train["production"][-750:], label="Real Values", color="blue")
plt.plot(range(len(fitted[-750:])), fitted[-750:], label="Fitted Values", color="red")
plt.legend(loc="best")
plt.show()
No description has been provided for this image

Real and predicted production amounts are plotted together and represented by blue and red lines respectively. This model works much better than Lasso regression model while predicting lower production values. Also, it seems the gap while predicting higher production values are smaller than both ARIMA and lasso regression models. Error metrices will give more clear result.

In [ ]:
accu(df_train["production"], fitted) # Calculate the error metrics
Out[ ]:
n mean sd CV FBias MAPE RMSE MAD MADP WMAPE
0 18260 2.565885 3.688927 1.437682 -0.000509 0.297014 1.351876 0.793577 0.30928 0.30928

This model has lower RMSE, MAD, MADP, and WMAPE than the previous models. However, its MAPE value is larger than ARIMA model. It underpredicts the production amounts, but absolute FBias value is smaller than previous models. Let's see how it works on test data.

In [ ]:
# Test the model
df_combined_test = pd.DataFrame(data = [forecasted_test["Arima"], forecasted_test["Lasso"]])
df_combined_test = df_combined_test.T
df_combined_test.columns = ['Lasso', 'Arima']
df_combined_test = sm.add_constant(df_combined_test)

df_combined_test
Out[ ]:
const Lasso Arima
0 1.0 -0.000512 0.850802
1 1.0 -0.000009 0.855853
2 1.0 -0.000236 0.928610
3 1.0 -0.000008 0.940223
4 1.0 -0.000109 0.998466
... ... ... ...
2683 1.0 0.000000 0.000000
2684 1.0 0.000000 0.000000
2685 1.0 0.000000 0.000000
2686 1.0 0.000000 0.000000
2687 1.0 0.000000 0.000000

2688 rows × 3 columns

In [ ]:
# Plot the results
lm_test = result.predict(df_combined_test)
fitted_test = [max(0, i) for i in lm_test]
fitted_test = [min(10, i) for i in fitted_test] 

plt.figure(figsize=(20, 8))
plt.plot(range(len(fitted_test[-750:])), df_test["production"].iloc[-750:], label="Real Values", color="blue")
plt.plot(range(len(fitted_test[-750:])), fitted_test[-750:], label="Fitted Values", color="red")
plt.legend(loc="best")
plt.show()
No description has been provided for this image

Real and predicted production amounts are plotted together and represented by blue and red lines respectively. The combined model does not seem to work well in test data. It overpredicts points where production is low, and the gap while predicting higher production values are high. Error metrices will be high but let's see.

In [ ]:
accu(df_test["production"], fitted_test) # Calculate the error metrics
Out[ ]:
n mean sd CV FBias MAPE RMSE MAD MADP WMAPE
0 2688 2.600301 3.496188 1.344532 -0.150969 0.703175 1.803095 1.398366 0.537771 0.537771

The model overpredicts the production values. Its error metric values are very high and overall, its error metrices have higher values than the ARIMA model and lower values than the Lasso Regression model.

Note that this model works worse, especially in the parts where there is no production therefore in a prediction setting these parts should be given as 0 directly to improve the results.

In [ ]:
# Save the results
forecasted_train["Arima+Lasso"] = fitted
forecasted_test["Arima+Lasso"] = fitted_test

Let's plot and check the residuals to see overall performance.

In [ ]:
# Check the residuals
result.resid.plot(kind="kde", title="Residuals", figsize=(10, 5))
Out[ ]:
<Axes: title={'center': 'Residuals'}, ylabel='Density'>
No description has been provided for this image

Residuals seem to have zero mean but the shape seems narrower than the normal distribution.

In [ ]:
residual = result.resid

plt.figure(figsize = (12, 7))
plt.scatter(range(len(df_combined)),residual, label = "Residual", color = "blue", s = 10)
plt.title("Residuals of the Model")

plt.show()
No description has been provided for this image

Residuals of the model seems to have mean zero, but variance value seems high. Also, there are some outliers. Let's check aoutocorrelation for further information.

In [ ]:
plt.figure(figsize=(12.5, 7))  
plt.acorr(result.resid[:1000], maxlags=len(result.resid[:1000])-1, usevlines = False, marker='o')
plt.axhline(y=0.125, color='red', linestyle='--') 
plt.axhline(y=-0.125, color='red', linestyle='--') 
plt.title("Autocorrelation of Residuals")
plt.xlim(0, 1020)
plt.xlabel("Lag")
plt.ylabel("Autocorrelation")
plt.show()
No description has been provided for this image

3b. Model IV - Combined Model & Monthly Dummy Variables ¶

The combined model worked good in the train set, but error rates were high when it's applied to test set. Thus, improving the model with additional variables might result in our best model. The production amount has seasonality within the day, but the month which the production is done may also have an affect. The weather conditions are somewhat similar in every year, so adding monthly dummy variables might improve the model. Let's try this.

In [ ]:
# To improve the residuals' normality assumption
a = df_combined.copy()

df_combined["month"] = df_train["date"].dt.month
month_encoding = pd.get_dummies(df_combined["month"], prefix="month", dtype="int64")
df_combined = pd.concat([df_combined, month_encoding], axis=1)
df_combined.drop(columns=["month", "month_12"], inplace=True)

df_combined
Out[ ]:
const Lasso Arima month_1 month_2 month_3 month_4 month_5 month_6 month_7 month_8 month_9 month_10 month_11
0 1.0 0.000000 0.000000 1 0 0 0 0 0 0 0 0 0 0
1 1.0 0.000000 0.000000 1 0 0 0 0 0 0 0 0 0 0
2 1.0 0.000000 0.000000 1 0 0 0 0 0 0 0 0 0 0
3 1.0 0.000000 0.000000 1 0 0 0 0 0 0 0 0 0 0
4 1.0 0.000000 3.400000 1 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18255 1.0 0.462583 0.019966 1 0 0 0 0 0 0 0 0 0 0
18256 1.0 0.261788 0.000000 1 0 0 0 0 0 0 0 0 0 0
18257 1.0 0.260214 0.008114 1 0 0 0 0 0 0 0 0 0 0
18258 1.0 0.987310 0.000000 1 0 0 0 0 0 0 0 0 0 0
18259 1.0 0.940808 0.003297 1 0 0 0 0 0 0 0 0 0 0

18260 rows × 14 columns

In [ ]:
# Fit the model
lm_motnh = sm.OLS(df_train["production"], df_combined)
result_month = lm_motnh.fit()
print(result_month.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             production   R-squared:                       0.865
Model:                            OLS   Adj. R-squared:                  0.865
Method:                 Least Squares   F-statistic:                     8966.
Date:                Wed, 05 Jun 2024   Prob (F-statistic):               0.00
Time:                        12:59:52   Log-Likelihood:                -31487.
No. Observations:               18260   AIC:                         6.300e+04
Df Residuals:                   18246   BIC:                         6.311e+04
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.1820      0.036     -5.096      0.000      -0.252      -0.112
Lasso          0.3054      0.006     50.255      0.000       0.294       0.317
Arima          0.7877      0.005    158.518      0.000       0.778       0.797
month_1       -0.1619      0.045     -3.558      0.000      -0.251      -0.073
month_2        0.2506      0.051      4.902      0.000       0.150       0.351
month_3        0.0367      0.050      0.735      0.462      -0.061       0.134
month_4       -0.0820      0.051     -1.614      0.107      -0.182       0.018
month_5        0.1150      0.050      2.296      0.022       0.017       0.213
month_6        0.3700      0.050      7.362      0.000       0.271       0.469
month_7        0.2240      0.051      4.431      0.000       0.125       0.323
month_8        0.1905      0.050      3.774      0.000       0.092       0.290
month_9        0.0580      0.051      1.133      0.257      -0.042       0.158
month_10       0.0116      0.050      0.232      0.817      -0.087       0.110
month_11      -0.0666      0.050     -1.325      0.185      -0.165       0.032
==============================================================================
Omnibus:                     1994.176   Durbin-Watson:                   1.522
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            15208.041
Skew:                           0.237   Prob(JB):                         0.00
Kurtosis:                       7.446   Cond. No.                         70.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The adjusted R-squared value is 0.865, and it's slightly higher than the combined model. AIC and BIC values are very low which means there's no overfitting. However, some month dummy variables have high probability values such as month_3, 4, 9, 10, 11 have 0.462, 0.107, 0.257, 0.817 and 0.185, respectively. Because of probabilities, null hypothesis cannot be rejected. And we cannot say they have influence on the production. Let's see how it predicts the production amounts.

In [ ]:
# Plot the results
fitted = [max(0, i) for i in result_month.fittedvalues]
fitted = [min(10, i) for i in fitted]

plt.figure(figsize=(20, 8))
plt.plot(range(len(fitted[-750:])), df_train["production"][-750:], label="Real Values", color="blue")
plt.plot(range(len(fitted[-750:])), fitted[-750:], label="Fitted Values", color="red")
plt.legend(loc="best")
plt.show()
No description has been provided for this image

Red and blue lines represent fitted and real values, respectively. The model seems to have a problem predicting the production amount where it's highest. It seems overpredicting and underpredicting at some cases. It works better at predicting lowest production amounts but there're some points where it overpredicts. Let's see error metrices to have more clear information.

In [ ]:
accu(df_train["production"], fitted) # Calculate the error metrics
Out[ ]:
n mean sd CV FBias MAPE RMSE MAD MADP WMAPE
0 18260 2.565885 3.688927 1.437682 -0.002383 0.298079 1.340546 0.789914 0.307853 0.307853

FBias value is negative so the model overpredicts the production amounts. MAD, MADP, and WMAPE values are slightly less than the combined model. However, it's MAPE is slightly more than that. Lest's see how it works in test data set.

In [ ]:
# Test the model
b = df_combined_test.copy()

df_combined_test["month"] = df_test["date"].dt.month.tolist()
month_encoding_test = pd.get_dummies(df_combined_test["month"], prefix="month", dtype="int64")
df_combined_test = pd.concat([df_combined_test, month_encoding_test], axis=1)
df_combined_test.drop(columns=["month"], inplace=True)
for i in range(6,12):
    df_combined_test[f"month_{i}"] = 0

df_combined_test["month_1"] = 0

df_combined_test
Out[ ]:
const Lasso Arima month_2 month_3 month_4 month_5 month_6 month_7 month_8 month_9 month_10 month_11 month_1
0 1.0 -0.000512 0.850802 1 0 0 0 0 0 0 0 0 0 0
1 1.0 -0.000009 0.855853 1 0 0 0 0 0 0 0 0 0 0
2 1.0 -0.000236 0.928610 1 0 0 0 0 0 0 0 0 0 0
3 1.0 -0.000008 0.940223 1 0 0 0 0 0 0 0 0 0 0
4 1.0 -0.000109 0.998466 1 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2683 1.0 0.000000 0.000000 0 0 0 1 0 0 0 0 0 0 0
2684 1.0 0.000000 0.000000 0 0 0 1 0 0 0 0 0 0 0
2685 1.0 0.000000 0.000000 0 0 0 1 0 0 0 0 0 0 0
2686 1.0 0.000000 0.000000 0 0 0 1 0 0 0 0 0 0 0
2687 1.0 0.000000 0.000000 0 0 0 1 0 0 0 0 0 0 0

2688 rows × 14 columns

In [ ]:
# Plot the results
lm_test = result_month.predict(df_combined_test)
fitted_test = [max(0, i) for i in lm_test]
fitted_test = [min(10, i) for i in fitted_test]

plt.figure(figsize=(20, 8))
plt.plot(range(len(fitted_test[-750:])), df_test["production"].iloc[-750:], label="Real Values", color="blue")
plt.plot(range(len(fitted_test[-750:])), fitted_test[-750:], label="Fitted Values", color="red")
plt.legend(loc="best")
plt.show()
No description has been provided for this image

The red and blue lines represent fitted and real values, respectively. The model seems to underpredict highest production amounts and gap values seem high. Nearly in all the time points plotted above, the model overpredicts lowest production amounts. It does not seem to work well, but let's see error metrices.

In [ ]:
accu(df_test["production"], fitted_test) # Calculate the error metrics
Out[ ]:
n mean sd CV FBias MAPE RMSE MAD MADP WMAPE
0 2688 2.600301 3.496188 1.344532 -0.134415 0.685622 1.810888 1.387751 0.533689 0.533689

As it can be seen from above, the model overpredicts the production amount (FBias is negative). Its error values are higher than desired, but they're slightly less than the combined model. Thus, there's a little improvement.

In [ ]:
forecasted_train["Arima+Lasso+Month"] = fitted
forecasted_test["Arima+Lasso+Month"] = fitted_test
In [ ]:
# Check the residuals
result.resid.plot(kind="kde", title="Residuals", figsize=(10, 5))
Out[ ]:
<Axes: title={'center': 'Residuals'}, ylabel='Density'>
No description has been provided for this image
In [ ]:
residual = result.resid

plt.figure(figsize = (12, 7))
plt.scatter(range(len(df_combined)),residual, label = "Residual", color = "blue", s = 10)
plt.title("Residuals of the Model")

plt.show()
No description has been provided for this image
In [ ]:
plt.figure(figsize=(12.5, 7))  
plt.acorr(result.resid[:1000], maxlags=len(result.resid[:1000])-1, usevlines = False, marker='o')
plt.axhline(y=0.125, color='red', linestyle='--') 
plt.axhline(y=-0.125, color='red', linestyle='--') 
plt.title("Autocorrelation of Residuals")
plt.xlim(0, 1020)
plt.xlabel("Lag")
plt.ylabel("Autocorrelation")
plt.show()
No description has been provided for this image

In general, residuals seem to have similiar patterns in all of the plots above.

3b. Additional Model- Hourly Model ¶

The main approach of the project is the previous model, but they do not work that well. Thus, an additional model example will be given here to show another possible way to approach this forecasting problem. It seemed that the error rates changed from hour to hour, so trying to build a model for each hour might make sense. It's a long code, so only hour 6 is here for analysis but all of it can be found in a separate code file given at the end of the report.

In [ ]:
mask = df["hour"] == 6 # Get the data for 6 am
df_hour = df[mask]
df_hour.reset_index(drop=True, inplace=True)
df_hour_train = df_hour[df_hour["date"] < train_test_split_date] 
df_hour_test = df_hour[df_hour["date"] >= train_test_split_date] 
df_hour_train.reset_index(drop=True, inplace=True)
df_hour_test.reset_index(drop=True, inplace=True)

df_hour
Out[ ]:
date hour dswrf_surface_37.75_34.5 dswrf_surface_37.75_34.75 dswrf_surface_37.75_35.0 dswrf_surface_37.75_35.25 dswrf_surface_37.75_35.5 dswrf_surface_38.0_34.5 dswrf_surface_38.0_34.75 dswrf_surface_38.0_35.0 ... tmp_surface_38.5_34.75 tmp_surface_38.5_35.0 tmp_surface_38.5_35.25 tmp_surface_38.5_35.5 tmp_surface_38.75_34.5 tmp_surface_38.75_34.75 tmp_surface_38.75_35.0 tmp_surface_38.75_35.25 tmp_surface_38.75_35.5 production
0 2022-01-01 6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.796341 2.567324 2.522566 2.359434 2.846216 2.705534 2.432177 2.281041 2.198397 0.00
1 2022-01-02 6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 3.267960 3.036939 2.899968 3.503786 3.595176 3.412035 3.124256 2.973101 2.701385 0.00
2 2022-01-03 6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.685982 2.444194 2.271147 1.677709 2.806652 2.713992 2.389464 2.314727 2.063727 0.00
3 2022-01-04 6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 3.000705 2.770820 2.665036 2.517651 3.030488 2.939584 2.695832 2.535183 2.059412 0.00
4 2022-01-05 6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.874656 2.539388 2.493577 2.127801 2.904842 2.776704 2.482527 2.329574 2.183884 0.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
868 2024-05-19 6 0.026045 0.028509 0.031648 0.033406 0.035293 0.027719 0.029898 0.032447 ... 4.600513 4.443268 4.304340 4.708892 4.580921 4.429676 4.266114 4.186047 4.050966 3.38
869 2024-05-20 6 0.026680 0.028928 0.028132 0.028487 0.034453 0.028139 0.030316 0.032239 ... 4.658352 4.527340 4.460274 5.033127 4.693019 4.675568 4.542454 4.390034 4.473935 1.71
870 2024-05-21 6 0.028374 0.031025 0.034130 0.035865 0.037814 0.030239 0.032407 0.035775 ... 4.480847 4.324110 4.345556 4.721682 4.724790 4.453842 4.214341 4.048807 3.920218 2.85
871 2024-05-22 6 0.031127 0.033750 0.037233 0.039144 0.040755 0.032759 0.035543 0.038479 ... 4.496669 4.366344 4.265734 4.762483 4.583198 4.444055 4.294460 4.150988 4.040375 2.92
872 2024-05-23 6 0.032397 0.035636 0.038681 0.039964 0.041806 0.034229 0.037007 0.039103 ... 4.617001 4.605322 4.486241 4.954978 4.703689 4.577574 4.515144 4.338756 4.276375 2.62

873 rows × 253 columns

In [ ]:
# Prepare the data for the model

# Monthly dummy variables
df_hour_month_train = df_hour_train["date"].dt.month
df_hour_month_train = pd.get_dummies(df_hour_month_train, prefix="month", dtype="int64")
df_hour_month_train.drop(columns=["month_12"], inplace=True)
df_hour_month_train

df_model_hour = df_hour_month_train.copy()
df_model_hour["trend"] = np.arange(len(df_model_hour)) # Trend variable
df_model_hour = sm.add_constant(df_model_hour) # Add the constant

# Features for the model
features_train = df_hour_train.copy()
features_train.drop(columns=["date", "hour", "production"], inplace=True)
lst = weather_df.columns[4:-10]
for feature in lst:
    feature_col = df_hour_train.columns[df_hour_train.columns.str.contains(feature)]
    features_train[f"mean_{feature}"] = df_hour_train[feature_col].mean(axis=1)
features_train = features_train.filter(regex=r'^mean_')
features_train

df_model_hour= pd.concat([df_model_hour, features_train], axis=1) # Combine the data

df_model_hour
Out[ ]:
const month_1 month_2 month_3 month_4 month_5 month_6 month_7 month_8 month_9 ... mean_dswrf_surface mean_tcdc_low.cloud.layer mean_tcdc_middle.cloud.layer mean_tcdc_high.cloud.layer mean_tcdc_entire.atmosphere mean_uswrf_top_of_atmosphere mean_csnow_surface mean_dlwrf_surface mean_uswrf_surface mean_tmp_surface
0 1.0 1 0 0 0 0 0 0 0 0 ... 0.0 0.2964 0.5372 1.1688 2.1772 0.0 0.0 2.534127 0.0 2.452826
1 1.0 1 0 0 0 0 0 0 0 0 ... 0.0 9.3304 8.9960 9.9588 10.0000 0.0 6.8 5.801885 0.0 3.067450
2 1.0 1 0 0 0 0 0 0 0 0 ... 0.0 0.9296 0.0000 0.0284 1.0084 0.0 0.0 1.400544 0.0 2.066741
3 1.0 1 0 0 0 0 0 0 0 0 ... 0.0 0.0000 7.4092 9.3828 9.5536 0.0 0.0 3.605417 0.0 2.588966
4 1.0 1 0 0 0 0 0 0 0 0 ... 0.0 0.0000 0.0088 1.0592 1.3740 0.0 0.0 2.388424 0.0 2.411131
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
756 1.0 1 0 0 0 0 0 0 0 0 ... 0.0 6.8540 5.4612 1.7792 8.7156 0.0 6.0 2.982100 0.0 2.376728
757 1.0 1 0 0 0 0 0 0 0 0 ... 0.0 6.4448 10.0000 2.2528 10.0000 0.0 2.0 3.453101 0.0 2.393940
758 1.0 1 0 0 0 0 0 0 0 0 ... 0.0 5.7448 4.1112 0.0000 7.8704 0.0 0.8 1.968997 0.0 2.145257
759 1.0 1 0 0 0 0 0 0 0 0 ... 0.0 3.1328 9.7120 2.6400 9.7720 0.0 1.2 3.111251 0.0 2.362636
760 1.0 1 0 0 0 0 0 0 0 0 ... 0.0 2.4536 5.9020 6.8080 8.8128 0.0 0.4 2.635682 0.0 2.388490

761 rows × 23 columns

The weather variables described in the beginning affect production amounts, but there're a lot of them due to their location. Therefore, mean of every parameter is taken. Also, month parameters improved the previous model a bit, so they can be added too. Let's see how their Regression model works.

In [ ]:
# Fit the model
lm = sm.OLS(df_hour_train["production"], df_model_hour)
result = lm.fit()
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             production   R-squared:                       0.514
Model:                            OLS   Adj. R-squared:                  0.499
Method:                 Least Squares   F-statistic:                     35.42
Date:                Wed, 05 Jun 2024   Prob (F-statistic):          6.23e-100
Time:                        12:59:53   Log-Likelihood:                -779.69
No. Observations:                 761   AIC:                             1605.
Df Residuals:                     738   BIC:                             1712.
Df Model:                          22                                         
Covariance Type:            nonrobust                                         
================================================================================================
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
const                           -0.1364      0.237     -0.575      0.565      -0.602       0.329
month_1                          0.1966      0.118      1.661      0.097      -0.036       0.429
month_2                          0.3679      0.141      2.609      0.009       0.091       0.645
month_3                          0.5229      0.132      3.954      0.000       0.263       0.783
month_4                          0.9512      0.143      6.659      0.000       0.671       1.232
month_5                          0.9658      0.200      4.829      0.000       0.573       1.358
month_6                          1.7577      0.295      5.956      0.000       1.178       2.337
month_7                          2.1213      0.207     10.261      0.000       1.715       2.527
month_8                          1.2350      0.183      6.757      0.000       0.876       1.594
month_9                          1.1965      0.162      7.371      0.000       0.878       1.515
month_10                         0.6869      0.143      4.798      0.000       0.406       0.968
month_11                         0.1890      0.128      1.471      0.142      -0.063       0.441
trend                            0.0016      0.000     11.985      0.000       0.001       0.002
mean_dswrf_surface             -26.2331     37.378     -0.702      0.483     -99.614      47.147
mean_tcdc_low.cloud.layer       -0.0521      0.024     -2.173      0.030      -0.099      -0.005
mean_tcdc_middle.cloud.layer    -0.0092      0.015     -0.614      0.539      -0.039       0.020
mean_tcdc_high.cloud.layer       0.0024      0.016      0.151      0.880      -0.029       0.033
mean_tcdc_entire.atmosphere     -0.0188      0.019     -0.968      0.333      -0.057       0.019
mean_uswrf_top_of_atmosphere    15.2610      5.679      2.687      0.007       4.112      26.410
mean_csnow_surface               0.0366      0.022      1.650      0.099      -0.007       0.080
mean_dlwrf_surface               0.1746      0.093      1.871      0.062      -0.009       0.358
mean_uswrf_surface              17.3852     61.202      0.284      0.776    -102.766     137.536
mean_tmp_surface                -0.3867      0.159     -2.430      0.015      -0.699      -0.074
==============================================================================
Omnibus:                       70.193   Durbin-Watson:                   1.068
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              131.814
Skew:                           0.589   Prob(JB):                     2.38e-29
Kurtosis:                       4.664   Cond. No.                     1.27e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.27e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

The model's adjusted R-squared value is 0.510 (not a good value) and its BIC and AIC values are high. Also, some independent variables have high probability values such as mean_tcdc_high.cloud.layer, mean_tcdc_middle.cloud.layer, and mean_uswrf_surface. Because they seem not to have effect on the model, they can be dropped. Also, because data has seasonality, adding lagged values may improve it. These operations will be done and moving average term will be added for smoothşng. Let's see how it works.

In [ ]:
production_hour = df_hour_train["production"]

acf_values = acf(production_hour, nlags=24*5)
plt.figure(figsize=(20, 5))
plt.plot(range(len(acf_values)), acf_values, marker='o', linestyle='-')
plt.xlabel("Lag")
plt.ylabel("Autocorrelation")
plt.show()
No description has been provided for this image
In [ ]:
production_hour = df_hour_train["production"]

acf_values = pacf(production_hour, nlags=24*5)
plt.figure(figsize=(20, 5))
plt.plot(range(len(acf_values)), acf_values, marker='o', linestyle='-')
plt.xlabel("Lag")
plt.ylabel("Autocorrelation")
plt.show()
No description has been provided for this image

Data has a lot of autocorrelation and partial autocorrelation. It seems that the model can be improved by adding lagged values and moving average term. Let's see how it works.

In [ ]:
# Drop unnecessary columns
drop_lst_hour_6 = ["mean_dswrf_surface", "mean_tcdc_middle.cloud.layer", "mean_tcdc_high.cloud.layer", "mean_uswrf_surface"]

df_model_hour.drop(columns = drop_lst_hour_6, inplace = True)

df_model_hour["lag_2"] = df_hour_train["production"].shift(2, fill_value=0) # Lag 2
df_model_hour["ma_3"] = df_hour_train["production"].rolling(window=3).mean() # Moving Average 3
df_model_hour.bfill(inplace=True) # Fill the missing values if there are any
In [ ]:
# Fit the model
lm = sm.OLS(df_hour_train["production"], df_model_hour)
model_hour_6 = lm.fit()
print(model_hour_6.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             production   R-squared:                       0.874
Model:                            OLS   Adj. R-squared:                  0.870
Method:                 Least Squares   F-statistic:                     255.7
Date:                Wed, 05 Jun 2024   Prob (F-statistic):          5.25e-316
Time:                        12:59:53   Log-Likelihood:                -266.92
No. Observations:                 761   AIC:                             575.8
Df Residuals:                     740   BIC:                             673.2
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
================================================================================================
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
const                            0.0812      0.119      0.681      0.496      -0.153       0.315
month_1                          0.0035      0.060      0.058      0.954      -0.114       0.121
month_2                         -0.0005      0.072     -0.006      0.995      -0.142       0.141
month_3                          0.0182      0.068      0.268      0.789      -0.115       0.151
month_4                          0.0444      0.074      0.596      0.551      -0.102       0.191
month_5                         -0.0111      0.103     -0.108      0.914      -0.213       0.191
month_6                         -0.0454      0.154     -0.295      0.768      -0.347       0.257
month_7                         -0.0363      0.116     -0.312      0.755      -0.265       0.192
month_8                          0.0078      0.095      0.082      0.935      -0.178       0.194
month_9                          0.0124      0.085      0.145      0.885      -0.155       0.180
month_10                         0.0088      0.073      0.120      0.904      -0.135       0.152
month_11                         0.0049      0.065      0.076      0.940      -0.123       0.133
trend                         4.044e-05   7.79e-05      0.519      0.604      -0.000       0.000
mean_tcdc_low.cloud.layer       -0.0056      0.010     -0.566      0.571      -0.025       0.014
mean_tcdc_entire.atmosphere     -0.0056      0.006     -0.917      0.359      -0.018       0.006
mean_uswrf_top_of_atmosphere     1.2773      1.469      0.870      0.385      -1.606       4.160
mean_csnow_surface               0.0003      0.011      0.027      0.979      -0.021       0.021
mean_dlwrf_surface               0.0338      0.043      0.791      0.429      -0.050       0.118
mean_tmp_surface                -0.0616      0.076     -0.808      0.419      -0.211       0.088
lag_2                           -0.5405      0.031    -17.466      0.000      -0.601      -0.480
ma_3                             1.5290      0.039     39.689      0.000       1.453       1.605
==============================================================================
Omnibus:                      144.630   Durbin-Watson:                   2.980
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2776.416
Skew:                           0.205   Prob(JB):                         0.00
Kurtosis:                      12.348   Cond. No.                     5.13e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.13e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

After dropping the variables, adjusted R-squared value became 0.868 and it improved a lot. AIC nad BIC values seem to decrease by 2/3. However, as it can be seen from probability values, the model uses lag 2 and MA(3) the most. Other month and weather variables seem to have not much effect at this case. Relying on lagged values are not desired because it may result in accumulation of errors but let's check how it works at predicting. (Real and fitted values plot is not given because it does not show much information like in the previous cases)

In [ ]:
accu(df_hour_train["production"], model_hour_6.predict()) # Calculate the error metrics
Out[ ]:
n mean sd CV FBias MAPE RMSE MAD MADP WMAPE
0 761 0.572234 0.966553 1.689087 -1.045058e-14 0.081871 0.34363 0.136161 0.237946 0.237946

The model underpredicts but at very low rate. It's error metrices are better than the previous models. However, these error rates belong to only one hour while the others gave a total value. Let's see how it works in the test data set.

In [ ]:
# Prepare the test data
df_hour_month_test = df_hour_test["date"].dt.month
df_hour_month_test = pd.get_dummies(df_hour_month_test, prefix="month", dtype="int64")
for i in range(5,12):
    df_hour_month_test[f"month_{i}"] = 0

df_hour_month_test["model_1"] = 0

df_test_ = df_hour_month_test.copy()
df_test_["trend"] = np.arange(len(df_test_))
df_test_ = sm.add_constant(df_test_)

features_test = df_hour_test.copy()
features_test.drop(columns=["date", "hour", "production"], inplace=True)
lst = weather_df.columns[4:-10]
for feature in lst:
    feature_col = df_hour_test.columns[df_hour_test.columns.str.contains(feature)]
    features_test[f"mean_{feature}"] = df_hour_test[feature_col].mean(axis=1)
features_test = features_test.filter(regex=r'^mean_')
features_test

df_test_.reset_index(drop=True, inplace=True)
features_test.reset_index(drop=True, inplace=True)
df_test_ = pd.concat([df_test_, features_test], axis=1)

df_test_.drop(columns = drop_lst_hour_6, inplace = True)

df_test_["lag_2"] = df_hour_test["production"].shift(2, fill_value=0)
df_test_["ma_3"] = df_hour_test["production"].rolling(window=3).mean()
df_test_.bfill(inplace=True)

df_test_
Out[ ]:
const month_2 month_3 month_4 month_5 month_6 month_7 month_8 month_9 month_10 ... model_1 trend mean_tcdc_low.cloud.layer mean_tcdc_entire.atmosphere mean_uswrf_top_of_atmosphere mean_csnow_surface mean_dlwrf_surface mean_tmp_surface lag_2 ma_3
0 1.0 1 0 0 0 0 0 0 0 0 ... 0 0 1.4452 4.8652 0.000000 0.0 1.951960 2.320893 0.00 0.010000
1 1.0 1 0 0 0 0 0 0 0 0 ... 0 1 0.4480 0.5220 0.000000 0.0 1.259384 2.148027 0.00 0.010000
2 1.0 1 0 0 0 0 0 0 0 0 ... 0 2 0.2132 4.9336 0.000000 0.0 2.148184 2.364361 0.01 0.010000
3 1.0 1 0 0 0 0 0 0 0 0 ... 0 3 3.0724 9.9048 0.000000 0.0 3.036168 2.598037 0.01 0.030000
4 1.0 1 0 0 0 0 0 0 0 0 ... 0 4 8.4896 9.9736 0.000000 4.4 4.732233 2.873469 0.01 0.026667
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
107 1.0 0 0 0 0 0 0 0 0 0 ... 0 107 2.6604 7.4028 0.055186 0.0 5.848421 4.482993 2.43 2.880000
108 1.0 0 0 0 0 0 0 0 0 0 ... 0 108 4.1696 6.7316 0.059573 0.0 6.336136 4.663188 2.83 2.640000
109 1.0 0 0 0 0 0 0 0 0 0 ... 0 109 0.4836 3.4784 0.057561 0.0 5.202857 4.431244 3.38 2.646667
110 1.0 0 0 0 0 0 0 0 0 0 ... 0 110 0.0348 0.4372 0.056268 0.0 5.025079 4.407368 1.71 2.493333
111 1.0 0 0 0 0 0 0 0 0 0 ... 0 111 0.3552 4.0112 0.059798 0.0 5.380110 4.563517 2.85 2.796667

112 rows × 21 columns

In [ ]:
accu(df_hour_test["production"], model_hour_6.predict(df_test_)) # Calculate the error metrics
Out[ ]:
n mean sd CV FBias MAPE RMSE MAD MADP WMAPE
0 112 1.301607 1.133375 0.87075 0.037519 0.09922 0.405489 0.230922 0.177413 0.177413

The model seems to underpredict the production amounts. It's MAD, MADP, WMAPE, and MAPE values are lower than the previous models. However, because these values are from only 1 hour and the model depends mostly on lagged values, it is not chosen. As it is told previously, this model is here to show a different approach from our main one.

4. Comparison ¶

Comparisons are made throughout the report, but seeing all results together and having additional plots might give a more comprehensive analysis. Let's start with error metrices of the main approach. (Hourly model will not included here because it's an additional work not the main approach.)

In [ ]:
forecasted_train_df = pd.DataFrame(forecasted_train)

accuracy_train = []

# Calculate the error metrics for the training data
for col in forecasted_train_df.columns:
    accuracy_train.append(accu(df_train["production"], forecasted_train_df[col]))

accuracy_train_df = pd.concat(accuracy_train, ignore_index=True)
accuracy_train_df.index = forecasted_train_df.columns

accuracy_train_df
Out[ ]:
n mean sd CV FBias MAPE RMSE MAD MADP WMAPE
Arima 18260 2.565885 3.688927 1.437682 0.075443 0.213338 1.478055 0.808813 0.315218 0.315218
Lasso 18260 2.565885 3.688927 1.437682 -0.035382 0.703112 2.137464 1.499716 0.584483 0.584483
Arima+Lasso 18260 2.565885 3.688927 1.437682 -0.000509 0.297014 1.351876 0.793577 0.309280 0.309280
Arima+Lasso+Month 18260 2.565885 3.688927 1.437682 -0.002383 0.298079 1.340546 0.789914 0.307853 0.307853

ARIMA model underpredicts while the other models overpredict the production amount. Also, ARIMA model has the lowest MAPE value followed by ARIMA and Lasso combined model, the combined model with additional month variable, and Lasso regresion alone. For RMSE, MAD, MADP, and WMAPE the combined model with month variable works the best and the other are as follows: the combined model, ARIMA and Lasso regression alone. Overall, the combined model with additional month variable seems to wotk better than others. An additional observation is lasso regression works really bad compared to others. WMAPE is the error metric used in the competition phase, so let's plot it below.

In [ ]:
wmape_train = accuracy_train_df["WMAPE"]

plt.figure(figsize=(12, 8))
plt.bar(wmape_train.index, wmape_train, color='blue')
plt.title("WMAPE for Training Data")
plt.show()
No description has been provided for this image

ARIMA, the combined model and the combined model with month variable have very similar WMAPE values. Let's check MAPE too.

In [ ]:
production_train = df_train["production"]

denominator = sum(production_train)

mape_values = {}

# Iterate over each forecast model and calculate MAPE
for model_name in forecasted_train_df.columns:
    forecast_values = forecasted_train_df[model_name]
    mape_lst = []
    for i in range(len(forecast_values)):
        mape_lst.append(np.abs(production_train[i] - forecast_values[i]) / denominator)
    mape_values[model_name] = mape_lst

# Create box plot for all MAPE values
plt.figure(figsize=(10, 6))
plt.boxplot(mape_values.values(), tick_labels=mape_values.keys())  
plt.title("Box Plot of WMAPE for All Forecast Models")
plt.xlabel("Forecast Models")
plt.ylabel("WMAPE")
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image

As it can be seen from the box plots, all models are positivelt skewed. (Means are closer to left-end) Also, ARIMA model have the lowest mean, and it is followed by the combined model / combined model + month variable (they have nearly the same mean), and finally lasso regression. However, ARIMA model has a lot of outliers and the furthest outlier, so its errors might change a lot. The combined model and the combined model with month variables seem to have a similar pattern. They and Lasso regression have outliers until nearly the same point. But, Lasso regression's median-max range is longer than them. All models have the same minimum value, and Lasso regression has the highest maximum value and is followed by ARIMA, and the combined models.

Let's check performance on the test set, too.

In [ ]:
forecasted_test_df = pd.DataFrame(forecasted_test)

accuracy_test = []

# Calculate the error metrics for the test data
for col in forecasted_test_df.columns:
    accuracy_test.append(accu(df_test["production"], forecasted_test[col]))

accuracy_test_df = pd.concat(accuracy_test, ignore_index=True)
accuracy_test_df.index = forecasted_test_df.columns

accuracy_test_df
Out[ ]:
n mean sd CV FBias MAPE RMSE MAD MADP WMAPE
Arima 2688 2.600301 3.496188 1.344532 0.002229 0.198583 1.919275 0.947456 0.364364 0.364364
Lasso 2688 2.600301 3.496188 1.344532 -0.127096 0.933208 2.224446 1.784562 0.686290 0.686290
Arima+Lasso 2688 2.600301 3.496188 1.344532 -0.150969 0.703175 1.803095 1.398366 0.537771 0.537771
Arima+Lasso+Month 2688 2.600301 3.496188 1.344532 -0.134415 0.685622 1.810888 1.387751 0.533689 0.533689

ARIMA model underpredicts while the other overpredicts as in the train set. However, the case is different at MAD, MADP, WMAPE values. ARIMA works best at there, and is followed by the combined model + month, the combined model, and lastly lasso. As a whole, error metrices have higher values at the test set, but the performances of the combined models seem to decrease the most.

In [ ]:
wmape_test = accuracy_test_df["WMAPE"]

plt.figure(figsize=(12, 8))
plt.bar(wmape_test.index, wmape_test, color='blue')
plt.title("WMAPE for Test Data")
plt.show()
No description has been provided for this image

ARIMA has the lowest WMAPE as it's said above and lasso regression model has the highest value.

In [ ]:
production_test = df_test["production"]

mape_values = {}

# Iterate over each forecast model and calculate MAPE
for model_name in forecasted_test_df.columns:
    forecast_values = forecasted_test_df[model_name]
    mape_lst = []
    # Ensure that production_test and forecast_values have the same length
    min_length = min(len(production_test), len(forecast_values))
    for actual, forecast in zip(production_test[:min_length], forecast_values[:min_length]):
        mape_lst.append(np.abs(actual - forecast) / sum(production_test))
    mape_values[model_name] = mape_lst

# Create box plot for all MAPE values
plt.figure(figsize=(10, 6))
plt.boxplot(mape_values.values(), tick_labels=mape_values.keys())  
plt.title("Box Plot of WMAPE for All Forecast Models (Test Data)")
plt.xlabel("Forecast Models")
plt.ylabel("WMAPE")
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image

As in the train set data, all models are positively skewed. ARIMA model has the lowest mean and is followed by the combined models and lastly Lasso regression model. ARIMA model still has the furthest outlier but now the other models' furthest outliers seem close to it. It is followed by lasso, the combined model with month variables and lastly the combined model. The combined models seem very similar other than their furthest outliers. All models have the same minimum value, and the Lasso regression model has the highest maximum value. And it's followed by the combined models and ARIMA.

5. Conclusion and Future Work ¶

In conclusion, solar power has an importance role at producing energy, but because there're a lot of external factors which affect its performance, an effective forecasting model should be built. To achieve this objective, firstly an ARIMA model is built. Because it did not work very well, other methods were tried. Because there are a lot of variables, a penalized regression model is built to select among them. However, this model worked worse than ARIMA. Combination of different models might result in a better model, so the combined model is built. It worked slightly better than ARIMA in the error metrices sense. Adding dummy month variables improved a bit too. However, the combined model with month dummy variables have outlier further than the combined model. And they seem to have very similar results in error metrices. Therefore, the combined model is the final model. In the competition phase, because we wanted to see how they really work in real-life extent, all models are used in different days. Thus, WMAPE values changed a lot from day-to-day.

As it can be seen from the results, this approach did not work that well. To show another approach, an hourly model is showed above, but it has high error rates and depend mostly on lagged values. Relaying on lagged variables might be dangerous in forecasting setting and result in accumulation of errors. Another approach might be improving ARIMA model. It worked slightly worse than the combined model. Rather than combining it with penalized regression, variables affecting the production amount might be added such as monthly/daily dummy variables. Building very good hourly models can be a way too, but there're more than one models in that case, and all of their error in total create the daily error. Thus, hourly models should have smaller errors.

The main approach of this project and other ways to reach it are described above. The discussion finishes here and the related code files of the project can be found below.

6. Code ¶

Click here to see the Lasso+ARIMA model: Lasso+ARIMA Code

Click here to see the hourly model: Hourly Model Code